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Abstract 

Skylines emerged as a useful notion in database queries for selecting representative 
groups in multivariate data samples for further decision making, multi-objective optimiza- 
tion or data processing, and the /c-dominant skylines were naturally introduced to resolve 
the abundance of skylines when the dimensionality grows or when the coordinates are neg- 
atively correlated. We prove in this paper that the expected number of /c-dominant skylines 
is asymptotically zero for large samples when 1 < k < d — I under two reasonable (con- 
tinuous) probability assumptions of the input points, d being the (finite) dimensionality, in 
contrast to the asymptotic unboundedness when k = d. In addition to such an asymptotic 
zero-infinity property, we also establish a sharp threshold phenomenon for the expected 
(d — l)-dominant skylines when the dimensionality is allowed to grow with n. Several 
related issues such as the dominant cycle structures and numerical aspects, are also briefly 
studied. 



Key words. Skyline, dominance, maxima, random samples, Pareto optimality, threshold phe- 
nomena, multi-objective optimization, computational geometry, asymptotic approximations, 
average-case analysis of algorithms. 



1 Introduction 

The last decade has undergone a drastic change of information dissemination from Web 1 .0 to 
Web 2.0, the most notable representative products being YouTube and Facebook. Data have 
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been generated in an unprecedented pace and range, powerful search engines are indispens- 
able, and screening useful or usable information (via "sort engines") from the vast is gen- 
erally becoming more important than searching and gathering. Skylines of multivariate data 
sample were introduced for selecting representative groups in the database query literature by 
Borzsonyi et al. (see [7]) and had appeared in diverse areas under several different guises and 
names: Pareto optimality, efficiency, maxima, admissibility, elite, sink, etc.; see [11, 12] and 
the references therein for more information. These diverse terms reveal the importance of the 
use of skyline as an effective means of data summarization in theory and in practice. Many 
different notions and variants of skylines have been proposed in the literature, following the 
original paper [7]. In particular, the A;-dominant skylines were introduced by Chan et al. (see 
[9]) in situations when the skylines are abundant and have received much attention since, al- 
though they had already been studied in the Russian literature (see for example [3, 23]). We 
focus in this paper on the asymptotic estimates of such skylines and prove several types of 
threshold phenomena under different probability assumptions of the input samples, which, in 
addition to their theoretical interests, are believed to be useful for practitioners. 

Skylines and A;-dominant skylines The definitions of skyline and many of its variants are 
based on the notion of dominance. Given a c/-dimensional dataset a point p G ^ is said 
to dominate another point q G & if pj < qj for 1 < j < d, where p = (pi, . . . ,p„) and 
q = (gi, . . . , g„), and is less than in at least one dimension. The non-dominated points in 
^ are called the skyline (or skyline points) of ^. By relaxing the full dominance definition 
to partial dominance, we say that a point p G ^ k-dominates another point q G ^ if there 
are k dimensions in which pj is not greater than qj and is less than in at least one of these k 
dimensions'. The points in ^ that are not /c-dominated by any other points are defined to be 
the k-dominant skyline of ^; see [9]. See also [3] for a different formulation. 

The definition of A;-dominant skyline implies that for a fixed dataset the number of k- 
dominant skylines decreases as k becomes smaller. Such a monotonicity property will be used 
later. To see this, consider any point p in the unit square. It is a skyline (or 2-dominant sky- 
line) point if no other points have simultaneously smaller x- and smaller y-values; namely, no 
other points can lie in the shaded region ffl (where p is the dotted point in the middle of this 
figure). However, to be a 1-dominant skyline point requires that all other points must have si- 
multaneously larger x- and larger y-values, or, equivalently, they cannot lie in the shaded region 
ffl. 

On the other hand, the transitivity property of skylines fails for fc-dominant skylines when 
1 < k < d — 1, meaning that their cardinality may be zero and there may be cycles. 

The number of skyline points The number of skyline points is a key issue in their use and 
usefulness. This quantity under suitable random assumptions of the input is also important 
for practical modeling or reference purposes, as well as for the analysis of skyline-finding 
algorithms. The two major, simple, representative random models are hypercubes and sim- 
plices. Assuming that the input dataset Si = {pi, . . . , Pn} is taken uniformly and indepen- 
dently from the hypercube [0, l]'^, then it has been known since the 1960's (see [1]) that the 
expected number of skyline points of S is asymptotic to ^^°^"^| for large n and finite d, 

'if we change the definition of the fc-dominant skyline to be "exactly fc" (instead of > k) coordinates smaller 
than or equal to and at least 1 smaller than, then the same types of results in this paper also hold. 
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exhibiting the independence of the coordinates. (Intuitively, if one sorts according to one di- 
mension, then each other dimension roughly contributes \ogn skyline points.) On the other 
hand, if we assume that the input points are uniformly sampled from the rf-dimensional sim- 
plex + ■ ■ ■ + \xd\ < l,Xj G (—1,0]}, then the expected number of skyline points is 
asymptotic to T (i) n^^^, reflecting obviously a stronger negative correlation of the coordi- 
nates; see [5] and the references cited there. Here T denotes Euler's Gamma function. For the 
number of skyline points under other models, see [2, 14, 15, 25] and the references therein. 

On the other hand, in contrast to the recent growing trend of studying high dimensional 
datasets, not much is known for the expected number of skyline points when d is allowed to 
grow with n. Such a direction is especially useful as practical situations always deal with 
finite n and finite d (whose dependence on n is often not clear). The only exception along 
this direction is the uniform estimates given in [18] (see also [5]) for the expected number of 
skyline points in a random uniform samples of n points from the hypercube [0, 1]"'. While the 
order ^^°^"^\y^ may seem slowly growing as d increases, it soon reaches the order n when d 
is around \ogn, which is relatively small for moderate values of n. Consequently, the skyline 
points become too numerous to be of direct use. The growth of skyline points in the random 
d-dimensional simplex model is even faster and we can show that almost all points are skylines 
when d roughly exceeds j^^f^, again small for n not too large. 

The cardinality of fc-dominant skyline Since /c-dominant skyline were proposed (see [9]) 
to resolve the skyline- abundance problem, it is of interest to know their quantity under suitable 
random models. A critical step in applying A;-dominant skyline is to identify an appropriate k 
such that the size of the /c-dominant skyline is within the acceptable ranges. But this may not 
be always feasible. Consider the 5-dimensional dataset ^ given in Table 1 . The six points are 
all skyline points, one (pe) is the 4-dominant skyline point and no point is in the 3-dominant 
skyline. Clearly, pg is to some extent better than the other points since it contains two compo- 
nents with the lowest value 1. However, it was already mentioned in [9] that some A;-dominant 
skylines may be empty. For example, if we drop pe from ^, then the five points are all skyline 
points but all /c-dominant skylines are empty for 1 < < 4. In this example, other alternatives 
to A;-dominant skylines have to be used. Unfortunately, such a property of excessive skylines 
but few k-dominant skylines is not uncommon, and we show in this paper that, under the hyper- 
cube and the simplex random models, the expected number of /c-dominant skylines both tends 
to zero for large n and 1 < k < d — \. 



point 



skyline 4-dominant skyline 3-dominant skyline 



Pi (1,2,2,3,3) 


✓ 




P2 (3,1,2,2,3) 


✓ 




P3 (3,3,1,2,2) 


✓ 




P4 (2,3,3,1,2) 


✓ 




P5 (2,2,3,3,1) 


✓ 




P6 (2,3,1,1,3) 


✓ 


✓ 



Table 1: An example showing the property of many skylines but few k-dominant skylines. 
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Threshold phenomena We clarify two types of threshold phenomena for the expected num- 
ber of A;-dominant skylines in random samples. 



1. Large sample, bounded dimension: 



Expected number of fc-dominant skylines — )► 



0, iil<k<d-l] 
oo, if /c = d. 



as the sample size n — )■ oo. While such a result is not new and contained as a special case 
of the general theory developed in [3] for finite dimensional skylines, we will give an 
independent, transparent, self-contained proof, which, in addition to being more precise, 
can be extended to the case when the dimensionality goes unbounded with the sample 
size. 

2. Large sample, moderate dimension: There exists an integer (io = doin) ^ ^^°og" +1 

y 6 log log n 

such that (see (23)) 



as n — 7- oo, and the two cases d = do and d = do + 1 lead to two different oscillating 
functions, the first (d = do) fluctuating between and — J^-i and the second between 



^ and O (^ io°iog„ j , where 7 is Euler's constant; see (24) and (25). We consider 

only random samples from hypercubes. Other regions and other values of k, k < d — 1 
are expected to exhibit similar threshold phenomena with different do, but the analysis 
becomes excessively long and involved. More details will be discussed elsewhere. 

We see from these phenomena that the usual "curse of high dimensionality" has thus another 
form here which one may term "curse of constant dimensionality," which refers to the situation 
when no /c-dominant skyline point at all exists. Also the model where dimensionality can vary 
with the sample size is, at least from a practical point of view, more reasonable; see Sections 6 
and 7 for more discussions and details. 

Related works In addition to the partial dominance used in defining fc-dominant skylines 
(see [9]), there are also several other skyline variants for retrieving more representative points; 
these include skybands [24], top-A; dominating queries [20, 24, 27], strong skylines [28], sky- 
line frequency [10], approximately dominating representatives [21], er-sky lines [26], and top-A; 
skylines [8, 22]. See also the survey paper [20] for more information. 

Organization of the paper This paper presents a systematic study on the asymptotic esti- 
mates of the number of A;-dominant skyline points under random models. It is organized as 
follows. We derive in the next section (§ 2) an asymptotic vanishing property for the number 
of /c-dominant skyline points under a common hypercube model when the dimensionality is 
bounded. The extension to include more points in the partial dominant skyline is showed to 
suffer from a similar drawback in Section 3. We then prove in Section 4 that changing the 
underlying model from hypercube to simplex does not improve either the asymptotic vanishing 
property. Section 5 deals with a categorical model for which the results have a very different 



Expected number of {d — 1) -dominant skylines — )• 
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nature. Roughly, as the total number of sample points are finite in this model, the expected 
number of /c-dominant skylines will be asymptotically linear, meaning too many choices for 
ranking or selection purposes. All these results point to the negative side for the use of k- 
dominant skylines under similar data situations. We then address the positive side in the last 
few sections by considering again the hypercubes but with growing dimensionality. A sharp 
threshold phenomenon is discovered in Section 7 when d oo with n, the asymptotic approx- 
imations needed being derived in Section 6. Another new threshold result is given in Section 8 
of the expected number of dominant cycles. Section 9 provides a uniform lower-bound estimate 
for the expected number of skyline points for 1 < k < d — 1. We conclude in Section 10 with 
some numerical aspects of the estimates we derived. 

2 Random samples from hypercubes 

The simplest random model is the hypercube [0, l]"^, which is also the most natural and most 
studied one. They can also be used when data are discrete in nature but span uniformly over a 
sufficiently large interval. 

In this section, we derive asymptotic estimates for the expected number of /c-dominant 
skyline points in a random sample of n points ^ := {pi, . . . , p„} uniformly and independently 
drawn from [0, l]'^, d > 2. Let ^(n) denote the number of A;-dominant skyline points of ^. 
We first derive a crude upper bound for the expected number E[Afrf fc(n)], which implies that 
E[Mrf fc(n)] is asymptotically zero as n grows unbounded and 1 < k < d — 1. More precise 
estimates are possible and will be derived in Section 6. For a point p G [0, l]'^, denoted by 
Bk{p) the region of the points in [0, l]'^ that /c-dominates p. Also, \A\ denotes the volume of 
the region A. 

Theorem 1 (Asymptotic zero-infinity property for large n and bounded d). For fixed d > 2 



Proof. The case k = d has been known since the 1960's (see [1]) and were re-derived several 
times in the literature. We assume 1 < k < d — 1. Since Md^^^n) < Md4^i{n) for fixed d and 
for 1 < A; < d - 1, we only prove that E,[Md4-i{n)\ -> 0. 
We start from the integral representation 



because if x is not A;-dominated by any of the other n — 1 points, they all have to lie in the 
region [0, l]'^ \ i?fc(x). Here and throughout this paper, dx is the abbreviation of da;i ■ ■ ■ dx^. 

To estimate the integral in (2), we split it into two parts, one part having sufficiently small 
volume (corresponding roughly to small xi ■ ■ ■ Xd) and the other with | Bd-i{'x) \ bounded away 
from zero, rendering the term (1 — \Bd-i{'x)\)'"'^^ also small. 




(1) 



as n oo. 



W[Md4-i{n)] = n¥ (pi is a (rf — l)-dominant skyline point) 




(2) 
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For a fixed number t satisfying 1 < t < define the region 

g„ := IJ J X e [0, 1]'^ : xe < n"^ and JJx, < n"^* I . (3) 
i<e<d I jy^=e ) 



Then 

E[M,,,_i(n)] < n|g„| +n / (1 - |S,_i(x)|)"-Mx. 

•^[0,l]<^\Qn 

The volume of Qn is bounded above by 



\Qn\ <dn d I _rf_i^dx. 

xG[0,l]'* 



To estimate the last integral, let 

MS) := f_^,s^^ {d>2l 
xe[o,i]'' 

where < 5 < 1. Then ^2(5) = 5, and 

AaiS) ^ J\d-i (j^ dt (d>3). 

A simple induction gives 

|loe:(5K^2 
MS) = '^^^T^ > 2), 

and we obtain, by taking 5 — n~^*, 

\Qn\ = O (n-'ilogn)'-') , 

On the other hand, by an inclusion-exclusion argument, we have 



|S,_i(x)| = ^ JJo;, - (d - 1) n (4) 

i<e<d j^e i<j<d 



Now if X e [0, 1]'^ \ Q„, then 

|Sd_i(x)| > max TTxj > n 5"*. 

— i^e 

Thus, we have 

E[Md,d_i(n)] = O (n^-*(logn)^-^) + O (nexp (-(n - l)n-^*)) , (5) 

and we see easily that the right-hand side tends to zero by our choice of t. More precisely, if 
we take 

^ ^ _±_ L _ M^rrlogn) 
d—l\ log n 



so as to balance the two 0-terms in (5), then 




This and the monotonicity of Md^kiji) (in k) proves (1). ■ 

The fact that E[Mrf — ?■ implies that there are many cycles formed by the fc-dominant 
relation, but the corresponding cycle structures are very difficult to quantify; see Section 10 for 
some preliminary results. 



The asymptotic vanishing property (Theorem 1) for the expected number of /c-dominant sky- 
lines limits their usefulness if the input data are known to be in similar randomness conditions. 
In particular, if one is interested in finding the top-i^ representative points, then the probability 
of getting enough number of candidates tends to zero. A simple remedy to this situation (and 
still following the same notion of partial dominance between points) is to consider the number 
of points that are /c-dominated by a specified number, say j of other points, which we refer to as 
the "cloud" of A;-dominant skylines. But we show that this also suffers from similar vanishing 
drawback under the random hypercube model, unless j is chosen to be large enough. 

Let Ld,k{n,j) denote the number of points in the random sample {pi, . . . , p„} that are k- 
dominated by exactly j points, where the n points are uniformly and independently selected 
from [0, l]'^. Note that ^(ri, 0) is nothing but M(i.k{n). 

Theorem 2 (Asymptotic zero-infinity property for clouds of fc-dominant skylines). For fixed 
d>2andl<k<d-l, 



uniformly for < j = o{n^^ e)/'^)^ as n ^ oo, where e > Q is an arbitrarily small constant. 

The theorem roughly says that even allowing more flexible partial dominance relation, the 
expected number of the skylines so constructed still approaches zero as long as the dimension- 
ality is fixed. 

Proof. The case when k = d h also derived in [1] (under the name of "(j + 1)^' layer, 1-st 
quadrant-admissible points"), where it is showed that 



3 "Clouds" of /c-dominant skylines 






1 



j<ii<-<ifj_i<n 



from which we obtain 





(6) 



{d-l)\ 



1 



if log(n/(j + 1)) — 7- oo, where the symbol "~" means that the ratio of both sides tends to 1 as 
n goes unbounded. Alternatively, we can use the integral representation (see [4]) 



E[Ld4{n,j)]=n(^ , ) / (xi ■ ■ ■ x^)^ (1 - Xi ■ ■ ■ x^)" ^ dx 

\ J / J[0,l]'i 



by the change of variables t xi ■ ■ ■ x^. A straightforward evaluation then gives (6). 

Note that ^^-'^'^-'^j-'^'^^^ equals the probability that the first-quadrant subtree of the root has 
size j in random quadtrees; see [16, Appendix]. This connection also provides several other 
expressions for E[Ld4{n,j)]. For example, 

\ -J y o<e<n-i-j ^ / \j < < ; 

see also [5]. 

For the remaining cases, we consider only k = d — 1 and prove that E[Ld,(i_i(n, j)] — )• 0. 
The reason is that 

J2 LdAnJ)< J2 Ld4^i(nj:) {l<k<d-l). 
o<e<j o<e<j 

To see this, observe that if a point p (d— l)-dominates another point q, then p also fc-dominates 
qforl < k < d — 2. Thus, the sum on the left-hand side, which stands for the set that is 
A;-dominated by at most j points, is less than the sum on the right-hand side, the set that is 
(d — l)-dominated by at most j points. 

To prove E[L(i ^^^i {n, j)] — > 0, we apply the same argument used in the proof of Theorem 1 
starting from the integral representation 

E[L(i (i_i(n, j)] = n P(exactly j points in {p2, . . . , p„} that fc-dominate pi) 

= 4""'^) [ i?,_i(xy (1 - 5,_i(x))"-^-^ dx. 
\ J J J[0,1V 

Now we fix a constant t satisfying 1 < t < and then choose Q„ as in (3). Then we have 

m = O {n-\lognY'^) , 

and 

n-^* < |5rf-i(x)| < 1 (x G [0, l]'^ \ Qn). 

It follows that 



E[Ld,d_i(n,j)] <^|Qn|+rif'' . ^"j / B,_i(xy (1 - 5rf_i(x)r-'-^' dx 

\ J / J[0,l]\Qn 
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Now choose 



d — 1 \ log n 

So that 



n 



^ , exp (^-{n - 1 - j)n '^^^ * j = O (^n^^^n ^ '^-^^ = 0{n <*-i) 



and 



'-1 (j + ^) (logn)'*-! = O (n d-i ([ogn)^-!^ , 

uniformly for j = 0{n^). Thus 

E[L,,,_i(n, j)] = O (n-^(logn)'^-2+^ + n-^) ^ 0. 

This proves the theorem. ■ 

A more precise asymptotic estimate for E[Lcf rf_i(n, j)] will be derived in Section 6; see 
(21). Another easy special case is /c = 1, which is dual to the case k = d because we have 

E[Ld,i {n, j)] = E[LdAn, n ~ 1 - j)]. 

Thus, by (7), we have 



n- 



{d-mjo 

3 + d-l 
3 



for large n and < j = o{^/n). 

In general, if we are to select the top K representatives using such clusters of partial 
dominant skylines, then how large should j be? That is, what is the minimum m such that 
X]o<j<m ^d,kin, j) > Kl Some simulation results are given in Figure 1. 



4 Random samples from simplices 

We show in this section that the asymptotic vanishing property of fc-dominant skylines occurs 
not only in the case of the rf-dimensional hypercube distribution, but also in the rf-dimensional 
simplex distribution 

S'd = < X : — 1 < < and ||x|| := ^ \xj\ < 1 • 

I l<i<'i J 

In particular, 5*2 is the right triangle \|. Such a shape implies a negative dependence of the two 
coordinates and thus a larger number of skyline points. 

Let m]^^ (n) denote the cardinality of the A;-dominant skyline of the set := {pi, . . . , p„ }, 
where these n points are uniformly and independently distributed over Sd- For a point p G Sd, 
denote by b\^\p) the region of points in Sd that A;-dominate p. 
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Figure 1: Simulated values of "^^^j^^Ld^kin, j) for n = 100 (left) and 5000 (right). Interest- 
ingly, the simulations suggest some general pattern that seems independent of the size of the 
samples and they are consistent with our analysis since m has to be very large ( compared with 
n). 



Theorem 3 (Asymptotic vanishing property for finite-dimensional simplex). For I < k < 

d-1, 

'0, ifl<k<d-l; 
oo, ifk = d, 



E[Mji(r^)] - 
as n oo. 

Proof. For k = d,it is known (see [12]) that 



E 



[Mij^H] = dlnj^ (l - (1 - Ei<.<.^Oy"'dx 



0<j<d 



r (i) n^--^ (l + 0{dn--d 



where V denotes the Gamma function. Thus the expected number of skylines tends to infinity 
as n goes unbounded. 

Consider now 1 < A; < rf. It suffices to examine the case k = d — 1. For a point x G 5*^ 
(x ^ 0), let I := Then 5jli(^) C Bf_^{^). We now prove that 



> 



dW 



1). 



(8) 



Since ||^|| = 1, there is at least one coordinate > ^. Without loss of generality, assume 
1^,1 >i. Then Ei<,<je,l<^. Let 

T:={yeSd: < for 1 < j < d - 1 and < 0}. 
We have T c B^f_^{^) and 



\T\ = \SM<i\> 



dw 
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since T is itself a simplex. Thus (8) holds and we have 



l-d\ 



dx 



= 0{n{l-d-'y) 
^0, 



as n — )• oo. ■ 

We see in such a simplex model that the expected number of A;-dominant tends to zero at 
an exponential rate (in n), in contrast to the polynomial rate in the hypercube model. Does the 
expected number of fc-dominant skyline points always tend to zero? Here is a simple, artificial 
counterexample. 



Example 1. Assume ci = 4, /c = 3. Let 

A ■= {{~t, -2t, 3t, 4t) : 1 < t < 2} . 

Then any two points in A are incomparable (none dominating the other) by the relation of 
/c-dominance. Thus, the number of /c-dominant skyline points is equal to n almost surely if 
Pi, . . . , p„ are uniformly and independently distributed in A. 



5 A categorical model 

The preceding negative results are based on assuming that the points are generated from some 
continuous models, which are often a good approximation to situations where the input can as- 
sume a sufficiently large range of different values. What if we assume instead that the inputs are 
sampled from some discrete space, which is also often encountered in practical applications? 
We show in this section that the expected number of k-dominant skylines is always linear for 
1 <k < d,'m contrast to the asymptotic zero-infinity property we derived above. 

Assume that n points ^ := {pi, . . . , p„} are chosen uniformly and independently from the 
product space 

^ := (g) ^„ 

where 

= {1,2,...,^,} (m, >2). 

Let M^f^f^{n) denote the number of fc-dominant skylines in S>. Unlike the continuous cases, 

the variation of the random variables M^f^^.{n) is easier to predict as the number of possible 
points in is finite. Interestingly, the first-order asymptotic estimate for the expected value of 
-^11 (n) is independent of /c for 1 < /c < where the case k = d gives the expected skyline 
count. 

Theorem 4 (Asymptotic linearity for finite-dimensional categorical model). The expected num- 
ber of k-dominant skylines satisfies 

MM^,kin)] ^ 1 n <k <d;d>2), (9) 
n u 
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as n ^ oo, where 



u 



n 

l<i<d 



Un 



Now the problem is again the excessive number of skyline points. Such a discrete model 
exhibits another interesting phenomenon, not present for continuous model, namely, for fixed n, 
the expected number of A;-dominant skyline points is not monotonically increasing as d grows. 
Proof. Let x = (xi, 0:2, . . . , Xd) € 1^- Denote by B^^"^ (x) the set of points in ^ that /c-dominate 
X. Then 



E[M^'^J,,(n)] = nP(pi is a A;-dominant skyline point) 

/ r 1 \ 1 



u 



(10) 



If y e B^^ (x), then y is better than or equal to x in all coordinates (at least one better) except 
for the coordinates, say ji, . . . , for < £ < d — k. Thus 



n 

l<j<d 



Xj 1, 



and for 1 < k < d 



4"(x)= E E 

0<e<d-k l<ji<j2< - <je<d 



- 1) n K - . (11) 



Here the product 



Hi 



<i<d 



n 



Xi 



i¥=jr]r=l,--;i 

enumerates all possible locations in the d — £ {> k) coordinates that /c-dominant skyline point 
can assume, and the factor "—1" removes the possibility that all — £ coordinates are equal to 
the corresponding Xi. The last product in (11) describes all possible locations for the other £ 
coordinates. 



Since there is a unique point 1 := (l,...,l)in^ with 
sum on the right-hand side of (10) being exponentially sma' 
In the special case when all Uj = 2 for I < j < d, then 



= 0, all other terms in the 
1, we obtain (9). ■ 



(2'-i) E 



0<j<d-k 



where x G {1, 2^ and i denotes the number of times "2" occurs in x (and "1" occurring d 
times). The closed-form expression (10) simplifies 



E[mW (n)] = ^ 5: 



0<^<ci 



2'i 



i E 



0<j<d-k 



d 



n-1 
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Mean Mean 




Figure 2: A graphical rendering of¥\M^\,{n)] in the discrete space {0, for (i = 10, A; = 9 
and n = 1, . . . , 25 (left) and n = 25, . . . , 1000 (right). 




Figure 3: Two plots of the ratio E,[Mj^^l{n)]/n when d = 5, k = 3,4,5 (here the case k = 5 
corresponds to the skyline), Ui = 2 (left) and Ui = 5 (right). All curves in the left figure tend to 
the limit = 0.03125 while those in the right to 5^^ = 0.00032, which is almost zero. 
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from which it follows that 

E[MM(n)] 1 

■ y —7 as n — > oo. 

n 2"^ 

Since the product space ^ is finite, we can indeed fully characterize the asymptotic distri- 
bution of M^I.(n). 

Theorem 5 (Asymptotic binomial distribution for finite-dimensional categorical model). The 
distribution ofM^]^{n) is asymptotically equivalent to a binomial distribution with parameters 

n and 1/u. 

Proof. Let Xn denote the number of j's for which = (1, . . . , 1), 1 < j <n. Then, obviously, 
Xn is binomially distributed with parameters n and 1/u, namely, 

n\ 1 / _ 1 



Now if one of the points equals (1, . . . , 1), then M^l.{n) — X„. Thus 

\ n 

and thus the distribution of M^^\ (n) is asymptotic to the distribution of X„. ■ 
In particular, we see that the variance of M^\{n) is also asymptotically linear 

^^^^^-1-- a<k<d). 

n u \ u J 

The consideration can be easily extended to the case of non-uniform discrete distributions. 
More generally, assume that the data set is sampled from the set {ai, . . . , cij„} C ^ and 
each point is endowed with the probability IP(aj). Let Pk{^j) be the probability that is k- 
dominated, that is, Pk{3ij) is equal to the sum of P(a^) such that /c -dominates a^. Then the 
expected number of /c-dominant skyline points satisfies 

E[MtlH]=n J2 P(a,)(l-P.(a,)r\ 

l<j<m 

Let 

l<j<m 

be the probability of points in {ai, . . . , a™} that are not A;-dominated. Then since the expected 
number of fc-dominant is expressed as a finite sum, we have 

■ > Qk, as n ^ oo. 

n 

Note that pk may range from zero to one. 
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6 Uniform asymptotic estimates for E[Mrf^_i(n) 

We derive in this section two uniform asymptotic estimates for E[M(i in two overlapping 

ranges. To state our results, we need to introduce the Lambert VT-function (see [13]), which is 
implicitly defined by the equation 

W{z)e^^'^ = z. (12) 

For our purpose, we take W to be the principal branch that is positive for positive z and satisfies 
the asymptotic approximation 

ly = logo; -log log a; + —j + O [ , (13) 

log a; \ (logx)^ / 

for large x. 

Our first asymptotic estimate covers d in the range 



3<d< 



21ogra 



W{2\ogn) + K'' 
where K ^ oo with n, and the second the range 



(logn)V3«rf<2./ 

y Iv (logn) — C 

for some constant C > 0. The upper bounds of the two ranges do not differ significantly but 
are sufficient for our purposes of proving the threshold phenomenon, which we discuss in the 
next section. 

Very roughly, the expected number of (c?— 1) -dominant skylines is asymptotically negligible 
in the first range, and undergoes the phase transition from being almost zero to unbounded in 
the second. 

Theorem 6 (Uniform estimate for large n and moderate d). Ifd>3 and 

-^-W{2\ogn)^oo, (14) 

then 

E[Mrf,rf_i(n)] = ^— ^rf^j (l + 0(t/n-™-2))), (15) 

uniformly in dfor large n. 
Note that if d is of the form 



21ogr;, 



Wi2\ogn) + 2v 
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then 

A .. / _ /(I + |t;|)Vr(21ogn)3/2 



(d-i)(d-2) = e'"" [1 + 



A/log n 



which becomes o(l) if f — )■ cxo. 

On the other hand, when d = 2, we have, by (2), 



E[Md,d-i{n)]=n f f {1 - x - y + xy)"" ^ dxdy = -. 

Jo Jo ^ 

Proof. We again begin with the integral representation (2), where Bd-i{x) is given in (4). 
By the elementary inequalities (see [6]) 



we have 
where 



e-"*(l - nt^) < (1 - t)" < e""* (n > 1; t G [0, 1]), 
En,d - K,d < E[Md,d-i{n + 1)] < 



'[0,1 



7-1/ 2 



[ |5rf_i(x)|2e-"l^''-W'dx. 

J[0,1]<* 



We will see that E'^ ^ is asymptotically of smaller order than En^d- The intuition here is that 
most contribution to the integral comes from x for which |i?rf_i(x)| is small, implying that 
(1 — |i?rf_i(x)|)" is close to e~"l^''~i'^'')L Also replacing n + 1 by n in the resulting asymp- 
totic approximation gives rise only to smaller order errors. However, the uniform error bound 
represents the most delicate part of our proof. 

We start with the asymptotic evaluation of En^d- By making the change of variables Xj h- )■ 



1^, where N := w^-^. 



En,d = [ e-^^ -^'*(^+-+^)+'^^^ -^My 



'[0,N]' 

(Mn) - fd{n) + Rd{n)) , (16) 



where 



0,(n) := / e-^^-^'*(^+-+i) 



dy, 



Un) ■.= [[-[ 1 e-^-^''(^^-+i)dy, 



Ra{n) := / e"^' "^'iw+"+^^ ( e'^^i-^'' - 1 ) dy. 



[0,N] 



d 



We focus on the evaluation of the integral leaving the lengthier estimation of the two 

error terms fdin) and Rd{n) to Appendix A. 
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We now carry out the change of variables tj :— Yle^^j Vt for 1 < j < d, the Jacobian being 







dyi 


dyi 




■ ,yd} . 


dti 






• ,td) ■ 


dVd 


dVd 






. dti 





whose determinant is equal to 1/ det J, where 

<9(ti, ...,td) 



J 



d{yi,--- ,yd) 



Note that the entries of J satisfy 



Ji 



It follows that 



0, if i = j; 

ViVj 

det J= (yi---yd)'^-Metr, 



where T is ad x d matrix with Tj j = and Tj ^ = 1 for i 7^ j. The determinant of T is seen to 
be {—lY^^{d — 1) by adding all rows of T to the first, by taking the factor d — 1 out, and then 
by subtracting the first row from all other rows. Thus we have 

detJ^{-ir-\d-l){y,---yar-' 
Thus, by the integral representation of the Gamma function 



r{x) 



we obtain 



(f>d{n) = 



d-1 

1 

d-1 
1 



-{ti+-+td) 



_ d-2 

e "li d.-^du 



{x > 0), 

d 



d-1 \d-l 

We will prove in Appendix A that 

fd{n) 
(t>d{n) 
RJn) 



(t>d{n) 



O [dn (d-i)(d-2) j ^ 
O [d2-'^rr^^ 



(17) 
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In a similar manner, we have 

= O £^ (t, + . . . + uf e-(*^+ -+*^)(ti ■ ■ ■ t.)-Hdt 

The last integral in a more general form can be evaluated as follows. Let [z'^]f{z) denote the 
coefficient of in the Taylor expansion of /. 

(t^ + ... + t,)^-e-(*i+-+*^)(t,...t,)-Hdt 

-An- -T- 



1 - 

1 Yfn^.+j 



d-lj \ J 



for j > 0. Thus 



<\>d{n) 



Collecting these estimates proves the theorem. ■ 

When d increases beyond the range (14), the error term /^(n) (see (16)) is no more negligi- 
ble, and a more delicate analysis is needed. 



Theorem 7 (Uniform asymptotic estimate in the critical range). If 

d 



then, with p :— 



— 7- oo and d < 2 



logn 



4 log n 
(e log 2)2 



(18) 



en 



l/d2 > 



uniformly in dfor large n. 

The proof of this theorem is very long and is thus relegated in Appendix B. The crucial 
step is to prove an asymptotic estimate for fd{n) by an inductive argument by deriving first a 
recurrence of the form 

fd{.n) = gain) + + smaller order terms, 
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where 



and $ is an operator defined by 

^fd]in):= J2 [M-iyn^'^[ {vi ■ ■ ■ Vj)'^'^ fd-jinvi ■ ■ ■ Vj)dv 



Then (19) follows from iterating the operator and a careful analysis of the resulting sums. 
Corollary l.Ifd is of the form 



d 

then 



2\ogn 



W{2\ogn) -2v-2 



1, ifv — )■ — oo; 

, ^ 2^7^' r^ = 0(l); (20) 



E[M,,,_i(n)] 



^r(^)'^ [i; ifv ^oo. 

Proof. Observe that 

_ d _ / / l+\v\ 

P- ^^xid^ V VW^(21ogn) 

Thus (20) follows from this and ( 19). ■ 

Combining the ranges (14) and (18) of the two estimates (15) and (19), we see that 

Corollary 2. If 



3<d<2 

then 



logn 



Vr(4e-2 1ogn)' 



d 



uniformly in d. 

We conclude from these estimates that E[Afrf rf_i(n)] is, modulo a constant term, very well 
approximated by "^^'^ V (^^y)'^- 

Remark. A similar analysis as that for (15) leads to (Ld,k{n, j) is defined in Section 3) 

E[Ld,d-i{n,3)] ^ Cdjn-^, (21) 
for each finite integer j > 0, where 

Cd,7 := T^^TTTT [ ivi + --- + Vdye'^''^+-+^^\v, ■ ■ ■ Vd)-^^dv 



{d-l)j\ 



d-i^[d-i) C / ' 

uniformly when ^'°f " — 1^(2 logn) — )■ oo and j = o (^n^^, e G (0, 1). The consideration 
for larger d as for (19) is similar. 
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7 Threshold phenomenon for E[Mrf^_i(n)] when d ^ oo 

With the asymptotic estimates (15) and (19) we derived in the previous section, we prove in this 
section a less expected threshold phenomenon for the expected number of (d — 1) -dominant 
skylines 'E[Md4-i{n)] (in random samples from d-dimensional hypercube) when ci — 1 is near 

2 log n 



Ty(21og n) ■ 

Theorem 8 (Threshold phenomenon). Let 

do = do{n) := 



2 logn 
W{2\ogn) 



+ 1, 



(22) 



where W denotes the Lambert-W function. Then the expected number of (d — l)-dominant 
skyline points satisfies 



lim E[Md,d-i{n)] ^ 



0, ifd < do; 
oo, ifd > do + 1. 



(23) 



Ifd = do, then lim„_^oo IE[Mrf^rf_i(n)] does not exist and is oscillating between and ^ ^ 



E[Mrf,rf_i(n)] 



2-e- 



2 logn 



(24) 



where ^o{x) is bounded oscillating function ofx defined by 

Ifd = do + l, then lim„_j.oo ^[Md^d-iin)] does not exist and is oscillating between ^ <^>^d 



Q f log n 
I log log r 



E[M,,,_i(r2)] 



2 - 



-^1 



2 logn 
iy(21ogn) 



(25) 



where ^i{x) is an oscillating function ofx defined by 

Proof. By monotonicity, it suffices to examine the asymptotic behavior of W\Md4-i{n)\ for d 
near do- Observe that if 



d = do + m 



2 logn 



- r„ + m + 1, 



where m is an integer and r denotes the fractional part of \l y^i^2\ogn) ' ii^i^iGly. 



2 logn 



2 logn 



2 logn 
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then 



d 



P 



Mi + o 



Wn \m + r„ 



e 



-1 



where, here and throughout the proof, Wn := W{2 logn). Thus for bounded m 

1 1 



2-e-p 2 - e-*^" 

On the other hand, by (19) and the asymptotic estimate T{x) 
where 7 denotes the Euler constant, we see that 



X 



-1 



7 + 0{x) as X -> 0, 



n d-i 



d-1 \d-l 



1 



-■y+m—Tn 



2 logn 



y/hgn 



( ^0, 



if m < — 1; 



e '^(fi 



— )■ 00, 



if m > 2. 



This proves (23), (24) and (25). It remains to consider more precisely the behavior of (po{x) 

and (pi{x). 

Obviously, by definition, (fo{x) E (0,1] and ^pi{x) E [l,oo) because {x} E [0,1) for 
X E M+. If {x} = 0, then (po{x) = 1; more generally. 



1, if {x} \ogx = 0(1); 
0, if {x} \ogx — )■ 00. 



On the other hand. 



We now prove that 



cnix)^< ^' if(l- W)loga; = o(l) 
\ 00, if (1 - {x}) logo; 00. 



■ 2 

r„ = if and only if n = > 2). 
First, if n = i*^, then 2 logn = 2P \ogi and the positive solution to the equation (see (12)) 

W„e^" = 2iHogL 



(26) 



is given by Wn = 2 \ogi, as can be easily checked. Thus 

(i>2) 



2 log n 

Wn 

Conversely, if the relation (27) holds, then the positive solution to the equations 

^^ = z2, andTy„e^" = 21ogn, 



(27) 
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,■2 



is given hy n = . This proves (26). 
It follows particularly, by (19), that 

lim E[Mi,i_i 

This completes the proof of the theorem. ■ 

The function do of n on the right-hand side of (22) grows extremely slowly. Let at := 
with oi := 2. Then d = i + \ for ai < n < Oj+i, which is small for almost all practical sizes of 

n 

'2, if 2 < n < 15; 
3, if 16 < n < 19682; 
dn= I 4, if 19683 <n< 42949 67295; 

5, if 42949 67296 < n < 2.98- ■■ X 10^^; 

6, if 2.98 ■ ■ ■ X 10^^ <n< 1.03 ■ ■ ■ x lO^^. 

This partly explains why the asymptotic vanishing property of E[Mrf ^(r?,)] for large n and fixed 
d is "invisible" for moderate values of n. 

Note that we did not replace the Lambert-W function in (22) by its asymptotic expansion 
(13) so as to make the expression more transparent, the reason being that no matter how many 
terms of the asymptotic expansion of W we use, the resulting expression is never o(l). This 
is because all terms in the expansion are of orders in powers of log log n and log log log n, and 
they are all much smaller than log n in the numerator of the first term on the right-hand side of 
(22). 

Extending the same analysis to other values of k becomes more difficult and messy except 
for k = 1 for which we have 



E[Mrf,i (n)] =n [ {xi--- Xd^-'d^ = n^-" 



Note that this always tends to zero no matter how large the value of d is. 

On the other hand, for \ < k < d — can derive the more precise estimate 



E[Mrf,fc(^)] = oln [ expl-n V 



<jk<'i 



However, a more precise uniform asymptotic approximation (in n, d, and k) is less obvious and 
describing the corresponding threshold phenomena if any for other values of k also remains 
unclear. Intuitively, the asymptotic vanishing property is expected to hold as long as A; > d/2 
no matter d is finite or growing with n because the probability of a fc-dominance for a random 
pair of points is larger than one half, meaning that it is less likely to find fc-dominant skyline in 
such a case. 



8 Expected number of dominant cycles 

The asymptotic zero-infinity property can be viewed from another different angle by examining 
the number of dominant cycles. 
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Definition. We say that m points {pi, . . . , pm} form a fc-dominant cycle (of length m) if pi 
/c -dominates p^+i for i = 1, . . . , m — 1 and p^ fc-dominates pi. 

Roughly, the number of /c-dominant cycles is inversely proportional to the number of k- 
dominant skylines. Note that by transitivity there is no cycle when k = d. Thus the number 
of cycles seems a better measure to clarify the structure of A;-dominant skylines. However, the 
general configuration of the cycle structure is very complicated. We contend ourselves in this 
section with the consideration of cycles of length d when k = d — 1. 

Lemma 1. Let Cn,d denote the number of {d — 1) -dominant cycles of length d in a random 
sample ofn points uniformly and independently chosen from [0, 1]''. Then the expected value 
ofCn,d satisfies 

Proof. Since the total number of cycles of length d is given by (^) ^, we see that 

fn\ d\ 

lE[C„,d] ~ ( ^ j ({Pi' • • • ' P'^} form a (rf — l)-dominant cycle of length d) . 

Assume that {pi, . . . , p^} form a (d — l)-dominant cycle of length d. Let 

Pi = • • • (i = l,...,d). 

Then for each coordinate j, there exists an £ such that 

> P2,j > ■ ■> Pi,j, P£,j < Pl+lJ, P£+l,j > ■ ■ ■ > Pd,j > PlJ, 

and the ts are all distinct (rf! cases). Thus the probability of the event that {pi, . . . , p^} form a 
{d — 1) -dominant cycle is given by 

d!_ 

from which (28) follows. ■ 
In particular, we see that 

E1C„,1 = 

which means that half of the pairs are cycles, rendering the 1-dominant skylines less likely to 
occur. The first few other E[C„,d] are given by 



{E[C„,J},>3 



n(n-l)(n-2) n(ra-l)(n-2)(ra~3) n(n-l)(n-2)(ra-3)(n-4) 
108 ' 55296 ' 1036800000 ' 

n(n-l)(n-2)(n-3)(n-4)(n-5) 

1160950579200000 ' ' ' 



We see that the denominator grows very fast and we expect another type of threshold phe- 
nomenon. 
Let 

log n ^ ^ 



di :- 



23 



and Tn denote the fractional part of °f " — r + \. Also let 



vit) :- 



l + |log27r 
W + 1 



+ 



W(e-^logn) ' 2' 
W 



{\ogn){W + 1) 



t 



12Vr3 + (35 - 12 log 27r) ^ (34 - 24 log 27r)Ty + 23 + (log 27r)^ 



24{W + iy 

where t EM. and W represents W{e~^ log n) . Note that W is of order log log n. 
Theorem 9. The expected number of{d — \)-dominant cycles of length d satisfies 

CO, if 2 < d < di, 



lim E[Cn,d] ^ 

n— >oo 

When d = di, we can write Tn = v{t); then 

lim E[C„,rf] 

Proof. Write 

d = di — m 



0, ifd > di. 



ift^ -oo; 
~e*, ift = 0{l); 
— )• oo, ift — )■ oo. 



(29) 



ly(e-Mogn) 2 ^' 



where f = m + r„. Then a straightforward calculation using (28) and Stirling's formula gives 
1 



d 



logE[C„,d] = V {W{e-Hogn) + l) - 1 - | log27r 



O 



W{e-^ log nf + {v^ + l)W{e-^ log n) 
logn 



Thus E[C„^d] — )► oo if m > 1 and E[C„^d] — t- — oo if m < —1. When m = {v = r„), this 
asymptotic expansion is insufficient and we need more terms. If v = Tn = v{t), then the same 
calculation as above gives 



E[C„,,] = eM 1 + O 



+ 1 
logn 



This implies (29). 
Let 



Then 



i \ «- 

' 2 



+ 1 



(t > I). 



di = di{n) = i if ai < n < ctj+i. 
The first few values of are given as follows. 



i 


4 


5 


6 


7 


8 


9 


10 


11 


12 


ai 


3 


10 


49 


290 


2022 


16165 


145405 


1453435 


15982276 
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9 A uniform lower bound for E[M^ fc(n) 

The convergence rate in (1) is very slow if d is large and k is close to d. It is interesting to 
characterize the transition of Md^k^n) from zero to n as A; increases under the condition that d 
and n are fixed. However, the exact characterization is not easy, so we derive instead a lower 
bound that provides a good approximation to the real transition. 

Theorem 10 (Uniform lower bound in d, k and n). Define 



Then, for n>l and 1 < k < d — \, 

E[Md,fc(n)] >n/„(/3rf,fc), (30) 
where ^ 

In{x) ■.= X I t-\l-tf-^dt. 
J X 

Proof. Select two random points x, y uniformly and independently in [0, l]'^. Obviously, 

P (x fc-dominates y) = I3d,k- 
On the other hand, by definition, P (x fc-dominates y) = Jj^ |-Ba:(x) | dx. Thus 

/ |5fc(x)|dx = /3rf,fc. 

J [0,1]'* 

Let 

F(t) = |{xG [0,1]'^: |5fe(x)| <t}|, 
be the distribution function of \Bk{x)\. By Markov inequality 

t(l-F(t))< / |fifc(x)|dx (tG(0,l)). 

J[0,l]d 



Thus 
Define 



G{t) := max<J 1 - ^,0 



ThenF(t) > G{t). Now 

E[Md,fc(n)] = n 



[ (i-ifi,(x)irMx=n Ai-tr^^. (31) 

J[o,i]d' Jo 
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Figure 4: Simulation result o/E[Md,fc(n)] and the lower bound (30) for n = 1000, d = 100 and 
kfrom 50 to 100. 



Since the integral on the right-hand side of (31) becomes smaller if the distribution function 
F(t) is replaced by G{t), we have 



E[Md,k{n)]>n [ {1 - t) 
Jo 



.i dG(t) 



from which (30) follows. ■ 

A useful, convergent asymptotic expansion for /„(x), derived by successive integration by 
parts, is as follows. 

_ (1 -x)" 2(1 -x)"+i 



nx n{n + l)x'^ 

as long as X ^ l/n. In particular, /„(x) — )■ in this range of x. If xn — > c > 0, then 



the latter tending to 1 as c approaches zero. 

We see that the transition of /„(x) from zero to one occurs at x x (meaning that x is of 
order proportional to n^^). In terms of d and k, this arises when — )■ oo and (3d,k ^ n'^. Now, 
by known estimate for binomial distribution (see [17] and the references cited there) 

Pd,k X (2a - l)-^d-^/^2-^a-''\l - a)-^^-")^ 

when k > d/2 + K\/d, where a := k/d and > 1 is a constant. We deduce from this that the 
transition of IniPd,k) from zero to one occurs at clogn for some c G (0, 1). The exact location 
of this c matters less since /„ is simply a lower bound; see Figure 4. 
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10 Conclusions 



While the notion of /c-dominant skyline appeared as a natural means of solving the abundance 
of skyline, its use in diverse contexts has to be carefully considered, in view of the results 
we derived in this paper. We summarize our findings and highlight suggestions for possible 
practical uses. 

The asymptotic results we derived in this paper are either of a vanishing type or of a blow- 
up nature; briefly, they are either zero or infinity when the sample size goes unbounded, making 
the selection of representative points more subtle. The expected number of A;-dominant skyline 
points approaches zero under either of the following situations. 

• Hypercube: both d and k < d bounded; 

• Simplex: both d and k < d bounded; 

• Hypercube: extending the /c-dominant skyline to the dominance by a cluster of j points 
with both d and k bounded. 

In all cases, zero appears as the limit when n — oo. However, for practical purposes, n is 
always finite, and thus the above limit results become less useful from a computational point 
of view. One needs asymptotic estimates that are uniform in d, k and n. But such results are 
often very difficult. The uniform asymptotic approximation (15) we obtained leads to several 
interesting consequences, including particularly the threshold phenomenon (23). 

We conclude this paper by showing how the asymptotic results we derived above can be ap- 
plied in more practical situations. Assume that our sample is of size, say n = or n = 10^, 
and the dimensionality d is in the range {4, 5, 6, 7, 8} (smaller d may result in more biased in- 
ferences while larger d will yield too many skyline points). We also assume that our data set is 
sufficiently random and can be modeled by the hypercube model. If our aim is to choose a rea- 
sonably small number of candidates for further decision making, then how can our asymptotic 
estimates help? 

First, for this range of n and d, the expected numbers of skyline points can be easily com- 
puted by the recurrence relation (see [5]) 

'"-''^=^ E ^n'"'Vn,i {d>2), 

where iJ,n,d '■= ^[Md,d{n)], Hn^ := X]i<j<ni~'^ the harmonic numbers and i^n,i '■= 1, and 
are given approximately by 

{164.7, 426.3, 902.7, 1633.1, 2603} {n = 10^ = 4, 5, 6, 7, 8), 

and 

{304.9, 955.8, 2432.1, 5239.4, 9845} {n = 10^ = 4, 5, 6, 7, 8), 

which are often too many for further consideration. So we turn to {d — l)-dominant skyline 
and estimate their numbers by our asymptotic approximations. However, both Theorems 6 
and 7 have poor error terms, and a better numerical approximation to E,[Md^d-iin) for most 
moderately values of n and d is given by 

0<j<d-2 ^-^^ 
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We thus obtain, for example, the following numerical values 



E[Mrf,rf_i(10^)] 



d 


4 


5 


6 


7 


8 


(t^diji) - gain) 


0.61 


5.06 


24.85 


88.90 


243.96 


Monte Carlo 


0.57 


4.82 


23.98 


83.89 


226.65 



and 





d 


4 


5 


6 


7 


8 


E[Md,d_i(10^)] ^ 


<Pd{,n) - gd{n) 


0.31 


3.69 


24.94 


115.31 


404.7 




Monte Carlo 


0.29 


3.61 


24.38 


111.79 


386.08 



From these tables, one can choose a suitable d according to the need of practical uses. Here we 
also see the characteristic property of the skylines, either very few or very many points. 

Our Monte Carlo simulations are carried out by a three-phase algorithm (extending our 
two-phase maxima- finding one in [12]) for finding the A;-dominant skylines. Briefly, the first 
two phases are modified from the algorithms presented in [12] and the last phase removes all 
cycles. 
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Appendix A. Error analysis: d < 



2 logn 



W{2\ogn)+K 

1 

Recall that N := w^-^ and consider the integral 

Un) =( [ - [ ] (^^■■■+^)dy = f^) W, 

where 

■■= [ e-''Mil+-+V^)dy. (32) 

J[0,N]'i-ix{N,ooy 

So our (t)d{'n) = (^)'^ corresponds to (t>d,o{n); see (16). 
Proposition 1. Letd>3 satisfies -W{2 log n) —7- oo. Then 

fd{n) = O (0rf(n)rfiV-^) , (33) 

uniformly in d. 

Proof. We first prove that uniformly for I < j < d, 

<t>dM = O [t {^y-' N-^) . (34) 

Consider first the range 1 < j < (i — 2. By extending the integration ranges and then carrying 
out the changes of variables yi ^ Nv^-e+i for d — j + 1 < i < d, we obtain the bounds 

0,,(n) = N^ [ ! e-""^-^^^^-"-(-"-"^"-^"-"^)dydv 



(l,oo): 



N^vi-v-jyi-'-yci-, ( — H 1 —] 



d-j 
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By the change of variables yj i-> A ''-i xj for 1 < j < we have, for A > 0, 



It follows that 

d-j 



\ d-l-j j 3 I , ,1 1 

d-l-j i(i,ooy 



_J_ 



= {d-i-jy-'r (^) ^'n- 



d-l-j 



uniformly for 1 < j < d — 2. The remaining two cases j = d — 1, d are much smaller; we start 
with (j)d,d{n). By the same analysis used above, we have 



'(W,oo)rf 



I 



(iV,oo)<* 



By the inequality 



< / 7 ^dx. 



/■oo 

/ r"e-^*dt < A-^A^-"e-^^ (a > 0, A > 0), 



we obtain 



-N^xi---Xd-2l —-\ 1 , 

< • • • 

< JV-2-4--2('i-3) / ^ ^ 



Thus 



(35) 



^ ^_(d_2)(d-3) / (w-27V)dw 

J2N 

= O (2-<^n-'^+^+^e-^") . (36) 



31 



Finally, 



n) < I y ^ dx 

V,oo)-^ X,---Xa-l [-tl'r--- + ^^ 

< / -^dx. 



by the inequality of arithmetic and geometric means 

— + ■■■ + > Xi ■ ■ ■ Xd-\) ^'^ ■ 

d- \ \xx Xd-i J 

Applying successively the inequality (35), we obtain 



< ■ 

j\^-{dP-2d+2) 

d- 1 



< : e-^"\ 



It follows that 



O (^rf-^n-^+^-^e-") . (37) 



We see that both (j)d,din) and 0d,d(n) are much smaller than the right-hand side of (34). 
The remaining case is when d = 2. Obviously, 



00 poo 

N 



^)2,iH< / / e-y^-y'dy^dyi = e 
Jo Jn 



The upper bound (33) then follows from summing for j from 1 to using (34) 



since (iA^ ^-2 — )■ for in the range (14). 

It remains to estimate Rd{n), which can be proved to be bounded above by 

Rd{n) = 0\^ I^Ji- ■ ■y,e-^^-^''(^+-+^)dy 
1 / 2 ^ 



this proves (17). 
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Appendix B. Proof of Theorem 7 

We prove Theorem 7 in this Appendix. Our method of proof consists in a finer evaluation of 
the integrals (f)d,j{n), leading to a more precise asymptotic approximation to /d(n). 

Proposition 2. Uniformly for d in the range (18) 

1-e-P 1 „ / 1 ^ 



where p :- 



d 



Proof. Consider again (32) and start with the changes of variables yc i-> Nvd-t+i for c? — j + 1 < 

ct>dAn) = N^ [ [ e"'"'^^^^'^-'''-^(-^-^^^^''-^^)dydv, 



< {1,00)3 J [0,Ar]d-j 

where AArj (v) := NHi ■ ■ - Vj. Then we carry out the change of variables 

ye ^ ^Nji^r^^Xi {I < i < d- j), 

and obtain 

4>d,j{n) = ipd,j{n) + ujd,j{n), 

where 

1 1 / -xi---Xd 



with 

d-l 1 1 

and the error introduced is bounded above by 



'(l,oo)J 



JL,...,_^ / _L I ... I -L 



X / e 'V^i' '^'^-.y I e V"i ' ' "^7 - 1 I dxdv 

lO,No]'i-3 



'(l,oo)J ^ 



2j 



e "'''-J/ xi ■ • ■ Xd_,dxdv 
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Thus the total contribution of ujd,j{n) to fd{n) is bounded above by 



(39) 

2 

d-l-i 



which will be seen to be of a smaller order. 
The recurrence relation Now 



(l,oo)J 

-1- 



d-j 



- N d-i-j / {vi--- Vj) ''-i-^ fd-j{nvi ■ ■ ■ Vj)dv 

J(l,oo)J 



d-i-j 



d-i-j / {vi - ■■ Vj) ''-i-^ fd-j{nvi ■ ■ ■ Vj)dv. 

i(l,oo)J 



So we get the following recurrence relation. 
Lemma 2. The integrals fd{n) satisfy 
fd{n) = gd{n) + hd{n) + r]d{n) 



t-^_oVjy 7(1,00). 



l<j<d-2 



(40) 

/or (i > 3, w«Y/z the initial condition 

/2(n)=2e-"-e-^ 

w/iere hd{n) is given in (39), 

i<i<<i-2 ^''^ 
andr]d{n) := 0d_d_i(n) + ipdA^)- 
Note that, by (36) and (37), 



34 



Also, by the change of variables 1 1— )■ t>i ■ ■ ■ t>j, we have 
fd{n) = gd{n) + hd{n) + 7]d{n) 

which is easier to use for symbolic computation softwares. 
We then obtain, for example, 

fz{n) = 3n^^ + O ^n^i j , 

3 1 / 2 \ 

/4(n) = An^n^e + O in~3 \ ^ 

^ , . SOtT^ _j_ 3 _i ^ / 

^ ^ 9r(|)' V ; 

But the expressions soon become too messy. 

Asymptotic estimate for (n) We derive first a uniform asymptotic approximation to gd{n), 
which will be needed later. We focus on the case when d tends to infinity with n. 

Lemma 3.1fd satisfies (18), then 

uniformly in d. 
Proof. First, we have 

d-l 

2 

uniformly for j = o{d^). Summing over all j gives (41). Here the errors omitted are estimated 
by the inequalities 

e)-o(fe-i), 

r(i)<x, (x^>i) 
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for 1 < j < (i — 2, and we see that the contribution of terms in gd{n) with indices larger than, 
say jo := L'^^J are bounded above by 



^ ^ j>jo ■' J 



(d-i)(d-i-i) 



j>jo 

= o 



j>jo 

1 ^ / 1 Y p^'' 



o I r , , 

d — 1 \d — 1 J jo' 



Thus for d in the range (18) 

jo log p — log jo! = |(i5 log d — d~5 logn + rfs + 0(\ogd) 

< - - i2tj (logn)iT)(loglogn)iT)(l + o(l)) 



40 

so that 



< -|5(logn)io(loglogn)io(l + o(l)), 



Jo! V / ' 



and the sum of these terms is asymptotically negligible. The errors X]j>jo estimated 
similarly. ■ 

Iteration of the ^-operator To derive a similar estimate for fd{n), we define the operator 

^fd]in):= J2 [M-iyn^'^f {vi ■ ■ ■ v,)'^'^ fd-jinvi ■ ■ ■ Vj)dv. 
l<j<d-2 J{i,ocy 

By iterating the recurrence (40), we obtain 

fd = 9d + hd + r]d+ ^ [9d + hd + Vd], 

where ^■' [/d] = denotes the j-th iterate of the ^-operator. 

Surprisingly, despite of the complicated forms of the partial sums, each can be 

explicitly evaluated and differs from gd only by a single term. 

Lemma 4. For any m > 

<^'^[9d]{n)= (^)(-l)'"'(^-l-^)'"'r(^)'^-^^-n^a^(£), (42) 

where am{^) is always positive and defined by 



■ I I • _» \Jly ' ' ' iJm+l/ 



JlH hjm+l = 

il,---Jm + l>l 
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Note that 

\ m+l 



r 

l<r<m+l 



Proof. By definition and by rearranging the terms 



Substituting this expression into the ^-operator, we see that 



X 



l<e<d-j-2 



Then 



= E )(-ir'-''"-'ra)'""--* E 

= E (')(-i)"(<i-i-f)"r(;a3,)""'^--^(2'-2). 

By repeating the same analysis and induction, we prove (42). ■ 
Corollary 3. If d satisfies (IS), then 

^'^[9d]{n) ~ (-l)™^r (^)'^ (1 - e-")""^' (m = 0, 1, . . . ). 
Summing over all < m < c? — 2, we deduce (38) and it remains only the error estimates. 

Error analysis The consideration of is similar and we obtain 

$-N(n)< Yl (^)2-v-i-^r-^r(^)'^-^n^-^a:„(£) 

m<e<d~2 ^ ^ 
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where 



• , , ■ \Jly ' ' ' iJm+lJ 



jlH — him+l=£ 
ilv.im + l>l 



d.[/]ze' (e^-l)™ 

0<r<m ^ 



Thus, with 



d 



which is always < log 2 when d satisfies (18), we then have 



d((i-l)2<* 



\ 0<r<m ^ ^ / 

O (poe"" ((e''« - 1)™-^ ((m + 1)6"° - 1))) • 



Now 



Y ((s-l)"-^((m + l)x-l)) =0(rf' 

0<m<d-2 

whenever < x < 2. It follows that 

, d 



0<m<d-2 

which holds uniformly as long as e''" < 2. This is how the upper limit of d in (18) arises. 
In such a case. 



0<m<d-2 



We consider now [77^] . Note that an exponentially small term remains exponentially small 
under the ^-operator because 



J(i,ooy 



'(1,oo):J 

So all terms of the forms $'"[77^] are asymptotically negligible. And we then deduce (38). 
More calculations give 

fdin) _ 1 - e-^ ^ pe-P f2p-l + {p + \)e~P 



±-V(J-Y 2-e-P (2-e- 



p)^ \ d 



2{p-3) + {p + 3)e'p \ ^ ^fpe-P, 3 , ,,f-, , l^S^ ^ 



logn +0{^{p' + l) 1 + 



d^ ^ J V V 

Note that the range (14) arises because we had to drop factors of the form (—1)-^ in esti- 
mating the sum of hd{n). With a more careful analysis along the same inductive line, we can 
extend the range of uniformity of (38). 
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